The development of music streaming services has revolutionized the music business by providing consumers with access to an extensive song collection. Spotify has emerged as the leading platform in the worldwide music streaming business, offering millions of tracks and securing a substantial market share. Due to its large user base and comprehensive dataset, Spotify offers a special chance to forecast and analyze track popularity. Music supporters, record labels, and musicians are all very interested in learning what makes a song popular. Precisely anticipating the level of popularity of songs can provide important information about the tastes of the audience, advertising tactics, and the general workings of the music business. This work intends to construct a prediction algorithm that can evaluate track popularity by utilizing Spotify’s Music Dataset.

Using Spotify’s dataset to forecast track popularity is the study’s research question/problem. This subject has in fact been investigated in the past, with a number of studies looking at the connection between track popularity, contextual factors, and auditory qualities. Certain aural characteristics, such as tempo, energy, and danceability, have been linked in certain studies to the popularity of a music. Track popularity has also been proven to be influenced by contextual factors, such as playlist inclusion, album release patterns, and artist popularity. But there’s still a lot to learn, especially when it comes to how audio and contextual information work together in comprehensive prediction models. Furthermore, research is still being done on the precise weight and interactions of these factors.

if (!require("tidyverse")) install.packages("tidyverse")
## Loading required package: tidyverse
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.4.4     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(tidyverse)
if (!require("ggplot2")) install.packages("ggplot2")
library(ggplot2)
if (!require("scales")) install.packages("scales")
## Loading required package: scales
## 
## Attaching package: 'scales'
## 
## The following object is masked from 'package:purrr':
## 
##     discard
## 
## The following object is masked from 'package:readr':
## 
##     col_factor
library(scales)
if (!require("caret")) install.packages("caret")
## Loading required package: caret
## Loading required package: lattice
## 
## Attaching package: 'caret'
## 
## The following object is masked from 'package:purrr':
## 
##     lift
library(caret)  
if (!require("viridis")) install.packages("viridis")
## Loading required package: viridis
## Loading required package: viridisLite
## 
## Attaching package: 'viridis'
## 
## The following object is masked from 'package:scales':
## 
##     viridis_pal
library(viridis)
if (!require("treemap")) install.packages("treemap")
## Loading required package: treemap
library(treemap)
if (!require("htmltools")) install.packages("htmltools")
## Loading required package: htmltools
library(htmltools)
if (!require("tm")) install.packages("tm")
## Loading required package: tm
## Loading required package: NLP
## 
## Attaching package: 'NLP'
## 
## The following object is masked from 'package:ggplot2':
## 
##     annotate
library(tm)
if (!require("readr")) install.packages("readr")
library(readr)
if (!require("ggcorrplot")) install.packages("ggcorrplot")
## Loading required package: ggcorrplot
library(ggcorrplot)
if (!require("nnet")) install.packages("nnet")
## Loading required package: nnet
library(nnet)
if (!require("ISLR")) install.packages("ISLR")
## Loading required package: ISLR
library(ISLR) 
if (!require("dplyr")) install.packages("dplyr")
library(dplyr)
# Load datasets
Spotify <- read.csv("spotify_songs.csv")
Spotify2 <- read.csv("spotify.csv")

Data Pre-Processing:

str(Spotify)
## 'data.frame':    32833 obs. of  23 variables:
##  $ track_id                : chr  "6f807x0ima9a1j3VPbc7VN" "0r7CVbZTWZgbTCYdfa2P31" "1z1Hg7Vb0AhHDiEmnDE79l" "75FpbthrwQmzHlBJLuGdC7" ...
##  $ track_name              : chr  "I Don't Care (with Justin Bieber) - Loud Luxury Remix" "Memories - Dillon Francis Remix" "All the Time - Don Diablo Remix" "Call You Mine - Keanu Silva Remix" ...
##  $ track_artist            : chr  "Ed Sheeran" "Maroon 5" "Zara Larsson" "The Chainsmokers" ...
##  $ track_popularity        : int  66 67 70 60 69 67 62 69 68 67 ...
##  $ track_album_id          : chr  "2oCs0DGTsRO98Gh5ZSl2Cx" "63rPSO264uRjW1X5E6cWv6" "1HoSmj2eLcsrR0vE9gThr4" "1nqYsOef1yKKuGOVchbsk6" ...
##  $ track_album_name        : chr  "I Don't Care (with Justin Bieber) [Loud Luxury Remix]" "Memories (Dillon Francis Remix)" "All the Time (Don Diablo Remix)" "Call You Mine - The Remixes" ...
##  $ track_album_release_date: chr  "6/14/19" "12/13/19" "7/5/19" "7/19/19" ...
##  $ playlist_name           : chr  "Pop Remix" "Pop Remix" "Pop Remix" "Pop Remix" ...
##  $ playlist_id             : chr  "37i9dQZF1DXcZDD7cfEKhW" "37i9dQZF1DXcZDD7cfEKhW" "37i9dQZF1DXcZDD7cfEKhW" "37i9dQZF1DXcZDD7cfEKhW" ...
##  $ playlist_genre          : chr  "pop" "pop" "pop" "pop" ...
##  $ playlist_subgenre       : chr  "dance pop" "dance pop" "dance pop" "dance pop" ...
##  $ danceability            : num  0.748 0.726 0.675 0.718 0.65 0.675 0.449 0.542 0.594 0.642 ...
##  $ energy                  : num  0.916 0.815 0.931 0.93 0.833 0.919 0.856 0.903 0.935 0.818 ...
##  $ key                     : int  6 11 1 7 1 8 5 4 8 2 ...
##  $ loudness                : num  -2.63 -4.97 -3.43 -3.78 -4.67 ...
##  $ mode                    : int  1 1 0 1 1 1 0 0 1 1 ...
##  $ speechiness             : num  0.0583 0.0373 0.0742 0.102 0.0359 0.127 0.0623 0.0434 0.0565 0.032 ...
##  $ acousticness            : num  0.102 0.0724 0.0794 0.0287 0.0803 0.0799 0.187 0.0335 0.0249 0.0567 ...
##  $ instrumentalness        : num  0.00 4.21e-03 2.33e-05 9.43e-06 0.00 0.00 0.00 4.83e-06 3.97e-06 0.00 ...
##  $ liveness                : num  0.0653 0.357 0.11 0.204 0.0833 0.143 0.176 0.111 0.637 0.0919 ...
##  $ valence                 : num  0.518 0.693 0.613 0.277 0.725 0.585 0.152 0.367 0.366 0.59 ...
##  $ tempo                   : num  122 100 124 122 124 ...
##  $ duration_ms             : int  194754 162600 176616 169093 189052 163049 187675 207619 193187 253040 ...
str(Spotify2)
## 'data.frame':    953 obs. of  25 variables:
##  $ track_name          : chr  "Seven (feat. Latto) (Explicit Ver.)" "LALA" "vampire" "Cruel Summer" ...
##  $ artist.s._name      : chr  "Latto, Jung Kook" "Myke Towers" "Olivia Rodrigo" "Taylor Swift" ...
##  $ artist_count        : int  2 1 1 1 1 2 2 1 1 2 ...
##  $ released_year       : int  2023 2023 2023 2019 2023 2023 2023 2023 2023 2023 ...
##  $ released_month      : int  7 3 6 8 5 6 3 7 5 3 ...
##  $ released_day        : int  14 23 30 23 18 1 16 7 15 17 ...
##  $ in_spotify_playlists: int  553 1474 1397 7858 3133 2186 3090 714 1096 2953 ...
##  $ in_spotify_charts   : int  147 48 113 100 50 91 50 43 83 44 ...
##  $ streams             : num  1.41e+08 1.34e+08 1.40e+08 8.01e+08 3.03e+08 ...
##  $ in_apple_playlists  : int  43 48 94 116 84 67 34 25 60 49 ...
##  $ in_apple_charts     : int  263 126 207 207 133 213 222 89 210 110 ...
##  $ in_deezer_playlists : chr  "45" "58" "91" "125" ...
##  $ in_deezer_charts    : int  10 14 14 12 15 17 13 13 11 13 ...
##  $ in_shazam_charts    : int  826 382 949 548 425 946 418 194 953 339 ...
##  $ bpm                 : int  125 92 138 170 144 141 148 100 130 170 ...
##  $ key                 : chr  "B" "C#" "F" "A" ...
##  $ mode                : chr  "Major" "Major" "Major" "Major" ...
##  $ danceability_.      : int  80 71 51 55 65 92 67 67 85 81 ...
##  $ valence_.           : int  89 61 32 58 23 66 83 26 22 56 ...
##  $ energy_.            : int  83 74 53 72 80 58 76 71 62 48 ...
##  $ acousticness_.      : int  31 7 17 11 14 19 48 37 12 21 ...
##  $ instrumentalness_.  : int  0 0 0 0 63 0 0 0 0 0 ...
##  $ liveness_.          : int  8 10 31 11 11 8 8 11 28 8 ...
##  $ speechiness_.       : int  4 4 6 15 6 24 3 4 9 33 ...
##  $ popularity          : num  5.66e+07 5.35e+07 5.60e+07 3.20e+08 1.21e+08 ...

Let us check for the dimensions of our spotify dataset:

# Output dataset dimensions
cat("Dimensions of Spotify:", dim(Spotify), "\n")
## Dimensions of Spotify: 32833 23
cat("Dimensions of Spotify2:", dim(Spotify2), "\n")
## Dimensions of Spotify2: 953 25
# Identifying missing values across columns
col_miss_Spotify <- colSums(is.na(Spotify))
if (any(col_miss_Spotify > 0)) {
  cat("Missing values in Spotify:", col_miss_Spotify[col_miss_Spotify > 0], "\n")
} else {
  cat("No missing values in Spotify\n")
}
## Missing values in Spotify: 5 5 5
col_miss_Spotify2 <- colSums(is.na(Spotify2))
if (any(col_miss_Spotify2 > 0)) {
  cat("Missing values in Spotify2:", col_miss_Spotify2[col_miss_Spotify2 > 0], "\n")
} else {
  cat("No missing values in Spotify2\n")
}
## Missing values in Spotify2: 1 57 58
# Find number of duplicate values
duplicate_obs_Spotify <- duplicated(Spotify)
cat("Number of duplicate observations in Spotify:", sum(duplicate_obs_Spotify), "\n")
## Number of duplicate observations in Spotify: 0
duplicate_obs_Spotify2 <- duplicated(Spotify2)
cat("Number of duplicate observations in Spotify2:", sum(duplicate_obs_Spotify2), "\n")
## Number of duplicate observations in Spotify2: 0
# Check for duplicate track ID
duplicate_id_Spotify <- duplicated(Spotify$track_id)
cat("Number of duplicate track IDs in Spotify:", sum(duplicate_id_Spotify), "\n")
## Number of duplicate track IDs in Spotify: 4477
# Checking summary of numerical variables
Spotify_num <- Spotify %>% select_if(is.numeric)
cat("Summary of numerical variables in Spotify:\n")
## Summary of numerical variables in Spotify:
print(summary(Spotify_num))
##  track_popularity  danceability        energy              key        
##  Min.   :  0.00   Min.   :0.0000   Min.   :0.000175   Min.   : 0.000  
##  1st Qu.: 24.00   1st Qu.:0.5630   1st Qu.:0.581000   1st Qu.: 2.000  
##  Median : 45.00   Median :0.6720   Median :0.721000   Median : 6.000  
##  Mean   : 42.48   Mean   :0.6548   Mean   :0.698619   Mean   : 5.374  
##  3rd Qu.: 62.00   3rd Qu.:0.7610   3rd Qu.:0.840000   3rd Qu.: 9.000  
##  Max.   :100.00   Max.   :0.9830   Max.   :1.000000   Max.   :11.000  
##     loudness            mode         speechiness      acousticness   
##  Min.   :-46.448   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.: -8.171   1st Qu.:0.0000   1st Qu.:0.0410   1st Qu.:0.0151  
##  Median : -6.166   Median :1.0000   Median :0.0625   Median :0.0804  
##  Mean   : -6.720   Mean   :0.5657   Mean   :0.1071   Mean   :0.1753  
##  3rd Qu.: -4.645   3rd Qu.:1.0000   3rd Qu.:0.1320   3rd Qu.:0.2550  
##  Max.   :  1.275   Max.   :1.0000   Max.   :0.9180   Max.   :0.9940  
##  instrumentalness       liveness         valence           tempo       
##  Min.   :0.0000000   Min.   :0.0000   Min.   :0.0000   Min.   :  0.00  
##  1st Qu.:0.0000000   1st Qu.:0.0927   1st Qu.:0.3310   1st Qu.: 99.96  
##  Median :0.0000161   Median :0.1270   Median :0.5120   Median :121.98  
##  Mean   :0.0847472   Mean   :0.1902   Mean   :0.5106   Mean   :120.88  
##  3rd Qu.:0.0048300   3rd Qu.:0.2480   3rd Qu.:0.6930   3rd Qu.:133.92  
##  Max.   :0.9940000   Max.   :0.9960   Max.   :0.9910   Max.   :239.44  
##   duration_ms    
##  Min.   :  4000  
##  1st Qu.:187819  
##  Median :216000  
##  Mean   :225800  
##  3rd Qu.:253585  
##  Max.   :517810
Spotify2_num <- Spotify2 %>% select_if(is.numeric)
cat("Summary of numerical variables in Spotify2:\n")
## Summary of numerical variables in Spotify2:
print(summary(Spotify2_num))
##   artist_count   released_year  released_month    released_day  
##  Min.   :1.000   Min.   :1930   Min.   : 1.000   Min.   : 1.00  
##  1st Qu.:1.000   1st Qu.:2020   1st Qu.: 3.000   1st Qu.: 6.00  
##  Median :1.000   Median :2022   Median : 6.000   Median :13.00  
##  Mean   :1.556   Mean   :2018   Mean   : 6.034   Mean   :13.93  
##  3rd Qu.:2.000   3rd Qu.:2022   3rd Qu.: 9.000   3rd Qu.:22.00  
##  Max.   :8.000   Max.   :2023   Max.   :12.000   Max.   :31.00  
##                                                                 
##  in_spotify_playlists in_spotify_charts    streams          in_apple_playlists
##  Min.   :   31        Min.   :  0.00    Min.   :2.762e+03   Min.   :  0.00    
##  1st Qu.:  875        1st Qu.:  0.00    1st Qu.:1.416e+08   1st Qu.: 13.00    
##  Median : 2224        Median :  3.00    Median :2.905e+08   Median : 34.00    
##  Mean   : 5200        Mean   : 12.01    Mean   :5.141e+08   Mean   : 67.81    
##  3rd Qu.: 5542        3rd Qu.: 16.00    3rd Qu.:6.739e+08   3rd Qu.: 88.00    
##  Max.   :52898        Max.   :147.00    Max.   :3.704e+09   Max.   :672.00    
##                                         NA's   :1                             
##  in_apple_charts  in_deezer_charts in_shazam_charts      bpm       
##  Min.   :  0.00   Min.   : 0.000   Min.   :  0.00   Min.   : 65.0  
##  1st Qu.:  7.00   1st Qu.: 0.000   1st Qu.:  0.00   1st Qu.:100.0  
##  Median : 38.00   Median : 0.000   Median :  2.00   Median :121.0  
##  Mean   : 51.91   Mean   : 2.666   Mean   : 51.18   Mean   :122.5  
##  3rd Qu.: 87.00   3rd Qu.: 2.000   3rd Qu.: 36.00   3rd Qu.:140.0  
##  Max.   :275.00   Max.   :58.000   Max.   :953.00   Max.   :206.0  
##                                    NA's   :57                      
##  danceability_.    valence_.        energy_.     acousticness_. 
##  Min.   :23.00   Min.   : 4.00   Min.   : 9.00   Min.   : 0.00  
##  1st Qu.:57.00   1st Qu.:32.00   1st Qu.:53.00   1st Qu.: 6.00  
##  Median :69.00   Median :51.00   Median :66.00   Median :18.00  
##  Mean   :66.97   Mean   :51.43   Mean   :64.28   Mean   :27.06  
##  3rd Qu.:78.00   3rd Qu.:70.00   3rd Qu.:77.00   3rd Qu.:43.00  
##  Max.   :96.00   Max.   :97.00   Max.   :97.00   Max.   :97.00  
##                                                                 
##  instrumentalness_.   liveness_.    speechiness_.     popularity       
##  Min.   : 0.000     Min.   : 3.00   Min.   : 2.00   Min.   :1.346e+03  
##  1st Qu.: 0.000     1st Qu.:10.00   1st Qu.: 4.00   1st Qu.:5.484e+07  
##  Median : 0.000     Median :12.00   Median : 6.00   Median :1.090e+08  
##  Mean   : 1.581     Mean   :18.21   Mean   :10.13   Mean   :1.882e+08  
##  3rd Qu.: 0.000     3rd Qu.:24.00   3rd Qu.:11.00   3rd Qu.:2.402e+08  
##  Max.   :91.000     Max.   :97.00   Max.   :64.00   Max.   :1.425e+09  
##                                                     NA's   :58
Below is the detailed data dictionary to understand all the variables present in the dataset: Track popularity is the outcome variable under investigation and prediction in this study. Track popularity is a metric that evaluates a song’s relative popularity within the Spotify ecosystem using data like the ones below: track_name - Song Name track_popularity - Song Popularity (0-100) where higher is better playlist_genre - Playlist genre danceability - Danceability describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. energy - Energy is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy. key - The estimated overall key of the track. Integers map to pitches using standard Pitch Class notation . loudness -The overall loudness of a track in decibels (dB). Loudness values are averaged across the entire track and are useful for comparing relative loudness of tracks.
mode - Mode indicates the modality (major or minor) of a track, the type of scale from which its melodic content is derived. Major is represented by 1 and minor is 0.
speechiness - Speechiness detects the presence of spoken words in a track. acousticness - A confidence measure from 0.0 to 1.0 of whether the track is acoustic. 1.0 represents high confidence the track is acoustic. instrumentalness - Predicts whether a track contains no vocals. “Ooh” and “aah” sounds are treated as instrumental in this context. liveness - Detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live. valence - A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. tempo - The overall estimated tempo of a track in beats per minute (BPM).

duration_ms - Duration of song in milliseconds

Exploratory Data Analysis (EDA): Let us start with the popularity analysis. For the purpose of this study, I am planning to classify track popularity attribute into different classes of low,medium and high popularity. As the dictionary mentions, track popularity is a value between 0 and 100. I am classifying the group as follows:

high - track popularity greater than 75 medium - track popularity between 30 and 75 low - track popularity less than 30

# Popularity Analysis
Spotify <- Spotify %>%
  mutate(popularity = case_when(
    track_popularity <= 30 ~ "low",
    track_popularity > 30 & track_popularity <= 75 ~ "medium",
    track_popularity > 75 ~ "high"
  ))

# Top tracks in the dataset
popular_track <- Spotify %>%
  filter(popularity == "high") %>%
  arrange(desc(track_popularity)) %>%
  distinct(track_name, track_popularity)

cat("Top tracks in Spotify with high popularity:\n")
## Top tracks in Spotify with high popularity:
print(head(popular_track, 10))
##             track_name track_popularity
## 1         Dance Monkey              100
## 2              ROXANNE               99
## 3                 Tusa               98
## 4             Memories               98
## 5      Blinding Lights               98
## 6              Circles               98
## 7              The Box               98
## 8  everything i wanted               97
## 9      Don't Start Now               97
## 10             Falling               97
# Create a summary of top artists within each playlist genre
artist_genre <- Spotify %>%
  dplyr::select(playlist_genre, track_artist, track_popularity) %>%
  group_by(playlist_genre, track_artist) %>%
  summarise(n = n()) %>%
  top_n(10, n)
## `summarise()` has grouped output by 'playlist_genre'. You can override using
## the `.groups` argument.
cat("Top 10 Track Artists within each Playlist Genre:\n")
## Top 10 Track Artists within each Playlist Genre:
print(artist_genre)
## # A tibble: 61 × 3
## # Groups:   playlist_genre [6]
##    playlist_genre track_artist                  n
##    <chr>          <chr>                     <int>
##  1 edm            Armin van Buuren             38
##  2 edm            Bassjackers                  38
##  3 edm            Blasterjaxx                  38
##  4 edm            Calvin Harris                40
##  5 edm            David Guetta                 60
##  6 edm            Dimitri Vegas & Like Mike    79
##  7 edm            Hardwell                     76
##  8 edm            Martin Garrix               125
##  9 edm            R3HAB                        38
## 10 edm            The Chainsmokers             49
## # ℹ 51 more rows
# Create a bar plot with varied colors
top_10_popular_songs <- head(Spotify[order(-Spotify$track_popularity),
                                     c("track_name", "track_artist", 
                                       "track_popularity")], 10)

ggplot(top_10_popular_songs, aes(x = track_name, y = track_popularity, 
                                 fill = track_artist)) +
  geom_bar(stat = "identity") +
  labs(title = "Top 10 Popular Songs",
       x = "Song",
       y = "Popularity") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Top artist by genre

The top artists list features many edm artists. This may be due to the high popularity of edm songs. So, what about the artists who creates songs in other genres.We will try to find out who are the top artists in each genre.We can use a tree map to analyze this.

library(ggplot2)
library(dplyr)
library(treemap)

# Load the dataset
Spotify <- read.csv("spotify_songs.csv")  # Replace "data.csv" with the actual file name and path

# Create a summary of top artists within each playlist genre
artist_genre <- Spotify %>%
  dplyr::select(playlist_genre, track_artist, track_popularity) %>%
  group_by(playlist_genre, track_artist) %>%
  summarise(n = n()) %>%
  top_n(10, n)
## `summarise()` has grouped output by 'playlist_genre'. You can override using
## the `.groups` argument.
# Create a treemap visualization
tm <- treemap(artist_genre, index = c("playlist_genre", "track_artist"), vSize = "n", vColor = 'playlist_genre', palette = viridisLite::viridis(6), title = "Top 10 Track Artists within each Playlist Genre")

# Display the treemap
print(tm) 
## $tm
##    playlist_genre                 track_artist vSize vColor stdErr vColorValue
## 1             edm             Armin van Buuren    38      1     38          NA
## 2             edm                  Bassjackers    38      1     38          NA
## 3             edm                  Blasterjaxx    38      1     38          NA
## 4             edm                Calvin Harris    40      1     40          NA
## 5             edm                 David Guetta    60      1     60          NA
## 6             edm    Dimitri Vegas & Like Mike    79      1     79          NA
## 7             edm                     Hardwell    76      1     76          NA
## 8             edm                Martin Garrix   125      1    125          NA
## 9             edm                         <NA>   630     11    630          NA
## 10            edm                        R3HAB    38      1     38          NA
## 11            edm             The Chainsmokers    49      1     49          NA
## 12            edm                       Tiësto    49      1     49          NA
## 13          latin                    Bad Bunny    32      1     32          NA
## 14          latin         Ballin Entertainment    42      1     42          NA
## 15          latin                 Daddy Yankee    61      1     61          NA
## 16          latin                     Don Omar   100      1    100          NA
## 17          latin                      Farruko    30      1     30          NA
## 18          latin               Gloria Estefan    43      1     43          NA
## 19          latin                     J Balvin    56      1     56          NA
## 20          latin                         <NA>   516     10    516          NA
## 21          latin                    Nicky Jam    45      1     45          NA
## 22          latin                        Ozuna    47      1     47          NA
## 23          latin               Wisin & Yandel    60      1     60          NA
## 24            pop                Ariana Grande    39      1     39          NA
## 25            pop                       Avicii    40      1     40          NA
## 26            pop                Calvin Harris    42      1     42          NA
## 27            pop                 David Guetta    44      1     44          NA
## 28            pop                 Javiera Mena    43      1     43          NA
## 29            pop                   Katy Perry    34      1     34          NA
## 30            pop                         Kygo    41      1     41          NA
## 31            pop                     Maroon 5    33      1     33          NA
## 32            pop                Martin Garrix    30      1     30          NA
## 33            pop                         <NA>   392     10    392          NA
## 34            pop             The Chainsmokers    46      1     46          NA
## 35            r&b               Anderson .Paak    30      1     30          NA
## 36            r&b                  Bobby Brown    42      1     42          NA
## 37            r&b                     D'Angelo    28      1     28          NA
## 38            r&b                        Drake    33      1     33          NA
## 39            r&b                  Erykah Badu    32      1     32          NA
## 40            r&b                  Frank Ocean    36      1     36          NA
## 41            r&b                          Guy    30      1     30          NA
## 42            r&b                Janet Jackson    39      1     39          NA
## 43            r&b                  John Legend    28      1     28          NA
## 44            r&b                         <NA>   339     10    339          NA
## 45            r&b                   The Weeknd    41      1     41          NA
## 46            rap                         2Pac    55      1     55          NA
## 47            rap                      50 Cent    53      1     53          NA
## 48            rap                        Drake    36      1     36          NA
## 49            rap                       Eminem    39      1     39          NA
## 50            rap                       Future    33      1     33          NA
## 51            rap                        Logic    65      1     65          NA
## 52            rap                         <NA>   451     10    451          NA
## 53            rap                      OutKast    34      1     34          NA
## 54            rap                    Rick Ross    44      1     44          NA
## 55            rap                     The Game    46      1     46          NA
## 56            rap         The Notorious B.I.G.    46      1     46          NA
## 57           rock                    Aerosmith    36      1     36          NA
## 58           rock Creedence Clearwater Revival    38      1     38          NA
## 59           rock                Guns N' Roses    76      1     76          NA
## 60           rock                         <NA>   532     10    532          NA
## 61           rock                        Queen   134      1    134          NA
## 62           rock                    Scorpions    44      1     44          NA
## 63           rock              The Cranberries    45      1     45          NA
## 64           rock           The Rolling Stones    36      1     36          NA
## 65           rock                      The Who    40      1     40          NA
## 66           rock                    Van Halen    42      1     42          NA
## 67           rock               オメガトライブ    41      1     41          NA
##    level        x0         y0          w          h   color
## 1      2 0.1805750 0.57555938 0.11285936 0.11772806 #30004C
## 2      2 0.1805750 0.45783133 0.11285936 0.11772806 #33004C
## 3      2 0.2934343 0.57555938 0.11285936 0.11772806 #35004C
## 4      2 0.3048471 0.69328744 0.10144662 0.13786575 #38004C
## 5      2 0.1805750 0.83115318 0.12424884 0.16884682 #3B004C
## 6      2 0.0000000 0.60499139 0.18057498 0.15296902 #3D004C
## 7      2 0.0000000 0.45783133 0.18057498 0.14716007 #40004C
## 8      2 0.0000000 0.75796041 0.18057498 0.24203959 #42004C
## 9      1 0.0000000 0.45783133 0.40629371 0.54216867 #440154
## 10     2 0.2934343 0.45783133 0.11285936 0.11772806 #45004C
## 11     2 0.3048238 0.83115318 0.10146989 0.16884682 #47004C
## 12     2 0.1805750 0.69328744 0.12427211 0.13786575 #4A004C
## 13     2 0.6001780 0.43050648 0.06344403 0.17635721 #2E3E7A
## 14     2 0.5169077 0.43050648 0.08327028 0.17635721 #2E3B7A
## 15     2 0.5496453 0.75608901 0.08744449 0.24391099 #2E387A
## 16     2 0.4062937 0.75608901 0.14335162 0.24391099 #2E357A
## 17     2 0.6636220 0.43050648 0.05947877 0.17635721 #2E337A
## 18     2 0.6223473 0.60686369 0.10075344 0.14922533 #2E307A
## 19     2 0.4062937 0.57907327 0.11061400 0.17701575 #2F2E7A
## 20     1 0.4062937 0.43050648 0.31680708 0.56949352 #414487
## 21     2 0.5169077 0.60686369 0.10543964 0.14922533 #322E7A
## 22     2 0.4062937 0.43050648 0.11061400 0.14856679 #342E7A
## 23     2 0.6370898 0.75608901 0.08601097 0.24391099 #372E7A
## 24     2 0.5143141 0.00000000 0.09442097 0.14442092 #147A80
## 25     2 0.5143141 0.14442092 0.09442097 0.14812402 #147680
## 26     2 0.5143141 0.29254494 0.10644499 0.13796154 #147280
## 27     2 0.4062937 0.13918630 0.10802043 0.14242320 #146E80
## 28     2 0.4062937 0.00000000 0.10802043 0.13918630 #146A80
## 29     2 0.6087351 0.19000342 0.11593461 0.10254153 #146680
## 30     2 0.6207591 0.29254494 0.10391059 0.13796154 #146280
## 31     2 0.6087351 0.09047782 0.11593461 0.09952560 #145E80
## 32     2 0.6087351 0.00000000 0.11593461 0.09047782 #145A80
## 33     1 0.4062937 0.00000000 0.31837602 0.43050648 #2A788E
## 34     2 0.4062937 0.28160950 0.10802043 0.14889698 #145680
## 35     2 0.8122171 0.00000000 0.07866579 0.13334271 #069758
## 36     2 0.7246697 0.27557494 0.09478583 0.15493153 #06975E
## 37     2 0.8908829 0.08972207 0.10911707 0.08972207 #069763
## 38     2 0.7246697 0.00000000 0.08754742 0.13179671 #069768
## 39     2 0.8122171 0.13334271 0.07866579 0.14223223 #06976E
## 40     2 0.7246697 0.13179671 0.08754742 0.14377823 #069773
## 41     2 0.8908829 0.17944415 0.10911707 0.09613079 #069778
## 42     2 0.9119846 0.27557494 0.08801542 0.15493153 #06977E
## 43     2 0.8908829 0.00000000 0.10911707 0.08972207 #069783
## 44     1 0.7246697 0.00000000 0.27533028 0.43050648 #22A884
## 45     2 0.8194556 0.27557494 0.09252903 0.15493153 #069788
## 46     2 0.8271381 0.78154683 0.08803154 0.21845317 #75BC32
## 47     2 0.9151696 0.78154683 0.08483039 0.21845317 #70BC32
## 48     2 0.8147365 0.52492609 0.12590723 0.09997370 #6BBC32
## 49     2 0.9129485 0.62489979 0.08705152 0.15664704 #66BC32
## 50     2 0.9406437 0.43050648 0.05935627 0.19439331 #61BC32
## 51     2 0.7231008 0.78154683 0.10403728 0.21845317 #5CBC32
## 52     1 0.7231008 0.43050648 0.27689921 0.56949352 #7AD151
## 53     2 0.8147365 0.43050648 0.12590723 0.09441961 #56BC32
## 54     2 0.8147365 0.62489979 0.09821198 0.15664704 #51BC32
## 55     2 0.7231008 0.60602665 0.09163571 0.17552018 #4CBC32
## 56     2 0.7231008 0.43050648 0.09163571 0.17552018 #47BC32
## 57     2 0.2678261 0.00000000 0.06923378 0.18181027 #E4A700
## 58     2 0.1947460 0.00000000 0.07308010 0.18181027 #E4AF00
## 59     2 0.0000000 0.00000000 0.12231983 0.21724545 #E4B700
## 60     1 0.0000000 0.00000000 0.40629371 0.45783133 #FDE725
## 61     2 0.0000000 0.21724545 0.19474604 0.24058587 #E4C000
## 62     2 0.1947460 0.31568875 0.10823369 0.14214258 #E4C800
## 63     2 0.1223198 0.00000000 0.07242621 0.21724545 #E4D100
## 64     2 0.3370599 0.00000000 0.06923378 0.18181027 #E4D900
## 65     2 0.3018257 0.18181027 0.10446798 0.13387847 #E4E200
## 66     2 0.3029797 0.31568875 0.10331397 0.14214258 #DDE400
## 67     2 0.1947460 0.18181027 0.10707968 0.13387847 #D5E400
## 
## $type
## [1] "index"
## 
## $vSize
## [1] "n"
## 
## $vColor
## [1] NA
## 
## $stdErr
## [1] "n"
## 
## $algorithm
## [1] "pivotSize"
## 
## $vpCoorX
## [1] 0.02812148 0.97187852
## 
## $vpCoorY
## [1] 0.01968504 0.91031496
## 
## $aspRatio
## [1] 1.483512
## 
## $range
## [1] NA
## 
## $mapping
## [1] NA NA NA
## 
## $draw
## [1] TRUE

Above, treemap depicts top 10 track artists with in each of the playlist genre. The size of the boxes in treemap corresponds to the count tracks for the artists. For genre edm, rock, pop, rap, latin and r&b, the top track artist are Martin Garrix, Queen, The Chainsmoker, Logic, Don Omar and Bobby Brown respectively.

One of Spotify’s most popular features is its Discover Playlist, a playlist that is generated each week based on a user’s listening habits. As a Spotify user I have found these playlists to be extremely accurate and useful. I wanted to make a try to build a basic version of it, a song recommendation engine based on different attributes as follows:

Based on Genre: Songs will be displayed as per the user preferred genre and rating scale. Based on Artists: Songs will be filtered as per the artist preference of the user and the rating scale. Based on Mood: Songs will be filtered as per the mood preference and rating scale specified by the user. For this purpose, songs have been classified into different groups like Gym(the songs with high energy),Cheerful(the songs with high valence),Party/Dance(the songs with high danceability) and Others.

# Select the variables for correlation
variables <- c('danceability', 'energy', 'loudness', 'speechiness', 'acousticness', 'instrumentalness', 'liveness', 'valence', 'tempo', 'duration_ms')

# Compute the correlation matrix
correlation_matrix <- cor(Spotify[, variables])

# Create a heatmap
library(ggplot2)
library(reshape2)
## 
## Attaching package: 'reshape2'
## The following object is masked from 'package:tidyr':
## 
##     smiths
melted_correlation <- melt(correlation_matrix)
ggplot(data = melted_correlation, aes(x = Var1, y = Var2, fill = value)) +
  geom_tile() +
  scale_fill_gradient(low = "darkblue", high = "pink") +
  theme_minimal() +
  labs(title = "Correlation Heatmap") +
  geom_text(aes(label = round(value, 2)), color = "white", size = 3) + coord_flip()

library(ggplot2)
library(scales)

# Assuming you have a Spotify frame called 'Spotify' with columns: loudness and energy

ggplot(Spotify, aes(x = loudness, y = energy)) +
  geom_point(color = "#FF6F00") +  # Set the color of the points to orange
  geom_smooth(method = "lm", color = "#FFAB00", se = FALSE) +  # Set the color of the smoother line to a lighter shade of orange
  scale_color_manual(values = c("#FF6F00", "#FFAB00")) +  # Match the colors for points and line
  labs(title = "Correlation between Loudness and Energy") +
  theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'

ggplot(data = Spotify) + 
      geom_point(mapping = aes(x = duration_ms, y = track_popularity,
                               color = playlist_genre, alpha = 0.12))

library(ggplot2)
library(scales)

# Assuming you have a Spotify frame called 'Spotify' with columns: track_popularity and acousticness

ggplot(Spotify, aes(x = track_popularity, y = acousticness)) +
  geom_point(color = "#1F77B4") +  # Set the color of the points to a blue shade
  geom_smooth(method = "lm", color = "#FF7F0E", se = FALSE) +  # Set the color of the smoother line to an orange shade
  scale_color_manual(values = c("#1F77B4", "#FF7F0E")) +  # Match the colors for points and line
  labs(title = "Correlation between Popularity and Acousticness") +
  theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'

Spotify$track_album_release_year <- as.numeric(format(as.Date(Spotify$track_album_release_date, "%m/%d/%y"), "%Y"))

# Calculate the average duration for each year
average_duration_by_year <- aggregate(duration_ms ~ track_album_release_year, Spotify, mean)

# Plot the change in duration over years
library(ggplot2)
ggplot(data = average_duration_by_year, aes(x = track_album_release_year, y = duration_ms)) +
  geom_line() +
  labs(x = "Year", y = "Average Duration (ms)", title = "Change in Duration of Songs over Years")

# Calculate the average duration for each genre
average_duration_by_genre <- aggregate(duration_ms ~ playlist_genre, Spotify, mean)

# Display the average duration for each genre
print(average_duration_by_genre)
##   playlist_genre duration_ms
## 1            edm    222540.9
## 2          latin    216863.4
## 3            pop    217768.1
## 4            r&b    237599.5
## 5            rap    214163.9
## 6           rock    248576.5
most_songs <- Spotify %>%
  group_by(track_artist) %>%
  summarize(total_songs = n_distinct(track_name)) %>%
  arrange(desc(total_songs)) %>%
  slice(1:15) %>%
  ggplot(aes(x = track_artist, y = total_songs, color = track_artist)) +
  geom_segment(aes(x = track_artist, xend = track_artist, y = 0, yend = total_songs)) +
  geom_point(size = 2, color = "maroon") +
  scale_color_viridis(discrete = TRUE, guide = "none", option = "E") +
  theme_light(base_size = 12, base_family = "HiraKakuProN-W3") +
  theme(
    panel.grid.major.x = element_blank(),
    panel.border = element_blank(),
    axis.ticks.x = element_blank()
  ) +
  labs(title = "Top 15 Artists with Most Songs",
       x = "Artist",
       y = "Total Songs") +
  coord_flip()

most_songs

most_listened <- Spotify %>%
  group_by(track_artist) %>%
  mutate(track_artist = iconv(track_artist, to = "UTF-8")) %>%
  summarize(listenedHours = sum(duration_ms) / 1000 / 3600) %>%
  arrange(desc(listenedHours)) %>%
  slice(1:15) %>%
  ggplot(aes(x = track_artist, y = listenedHours, color = track_artist)) +
  geom_segment(aes(x = track_artist, xend = track_artist, y = 0, yend = listenedHours)) +
  geom_point(size = 2, color = "cyan3") +
  scale_color_viridis(discrete = TRUE, guide = FALSE, option = "C") +
  theme_light(base_size = 12, base_family = "HiraKakuProN-W3") +
  theme(
    panel.grid.major.x = element_blank(),
    panel.border = element_blank(),
    axis.ticks.x = element_blank()
  ) +
  labs(title = "Top 15 most listened artists") +
  xlab("") +
  ylab("Hours") +
  coord_flip()

most_listened
## Warning: The `guide` argument in `scale_*()` cannot be `FALSE`. This was deprecated in
## ggplot2 3.3.4.
## ℹ Please use "none" instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

# Count the number of songs for each genre
genre_counts <- table(Spotify$playlist_genre)

# Create a bar graph of the genre counts
barplot(genre_counts, main = "Number of Songs by Genre", xlab = "Genre", ylab = "Count")

# Calculate the average popularity by genre
avg_popularity <- Spotify %>%
  group_by(playlist_genre) %>%
  summarise(avg_popularity = mean(track_popularity))

# Plot the average popularity by genre
ggplot(avg_popularity, aes(x = playlist_genre, y = avg_popularity)) +
  geom_bar(stat = "identity", fill = "steelblue") +
  labs(x = "Genre", y = "Average Popularity", title = "Average Popularity by Genre")

# Create the scatter plot
ggplot(Spotify, aes(x = valence, y = energy, color = track_name)) +
  geom_jitter(show.legend = FALSE) +
  scale_color_viridis(discrete = TRUE, option = "D") +
  geom_vline(xintercept = 0.5) +
  geom_hline(yintercept = 0.5) +
  scale_x_continuous(breaks = seq(0, 1, 0.25)) +
  scale_y_continuous(breaks = seq(0, 1, 0.25)) +
  labs(title = "How positive is your music?") +
  theme_light()
## Warning: Removed 5 rows containing missing values (`geom_point()`).

library(ggplot2)

# Assuming you have the Spotify frame 'spotify_songs' with columns: track_name, danceability, and energy

track_names <- Spotify$track_name
danceability <- Spotify$danceability
energy <- Spotify$energy

spotify_data <- data.frame(
  track_name = track_names,
  danceability = danceability,
  energy = energy
)

spotify_data %>%
  ggplot(aes(x = danceability, y = energy, color = track_name)) +
  geom_jitter(show.legend = FALSE) +
  scale_color_viridis(discrete = TRUE, option = "C") +
  labs(title = "Workout vibes") +
  theme_light()
## Warning: Removed 5 rows containing missing values (`geom_point()`).

# Read the dataset
Spotify <- read.csv("spotify_songs.csv")

# Create a scatter plot of population versus danceability
ggplot(Spotify, aes(x = track_popularity, y = danceability)) +
  geom_point() +
  labs(x = "Popularity", y = "Danceability") +
  ggtitle("Population versus Danceability")

# Load required packages
library(ggplot2)
library(dplyr)

# Load the Spotifyset
Spotify <- read.csv("spotify_songs.csv")  # Replace "Spotify.csv" with the actual file name and path

# Explore the Spotifyset
head(Spotify)  # Check the structure and contents of the Spotifyset
##                 track_id                                            track_name
## 1 6f807x0ima9a1j3VPbc7VN I Don't Care (with Justin Bieber) - Loud Luxury Remix
## 2 0r7CVbZTWZgbTCYdfa2P31                       Memories - Dillon Francis Remix
## 3 1z1Hg7Vb0AhHDiEmnDE79l                       All the Time - Don Diablo Remix
## 4 75FpbthrwQmzHlBJLuGdC7                     Call You Mine - Keanu Silva Remix
## 5 1e8PAfcKUYoKkxPhrHqw4x               Someone You Loved - Future Humans Remix
## 6 7fvUMiyapMsRRxr07cU8Ef     Beautiful People (feat. Khalid) - Jack Wins Remix
##       track_artist track_popularity         track_album_id
## 1       Ed Sheeran               66 2oCs0DGTsRO98Gh5ZSl2Cx
## 2         Maroon 5               67 63rPSO264uRjW1X5E6cWv6
## 3     Zara Larsson               70 1HoSmj2eLcsrR0vE9gThr4
## 4 The Chainsmokers               60 1nqYsOef1yKKuGOVchbsk6
## 5    Lewis Capaldi               69 7m7vv9wlQ4i0LFuJiE2zsQ
## 6       Ed Sheeran               67 2yiy9cd2QktrNvWC2EUi0k
##                                        track_album_name
## 1 I Don't Care (with Justin Bieber) [Loud Luxury Remix]
## 2                       Memories (Dillon Francis Remix)
## 3                       All the Time (Don Diablo Remix)
## 4                           Call You Mine - The Remixes
## 5               Someone You Loved (Future Humans Remix)
## 6     Beautiful People (feat. Khalid) [Jack Wins Remix]
##   track_album_release_date playlist_name            playlist_id playlist_genre
## 1                  6/14/19     Pop Remix 37i9dQZF1DXcZDD7cfEKhW            pop
## 2                 12/13/19     Pop Remix 37i9dQZF1DXcZDD7cfEKhW            pop
## 3                   7/5/19     Pop Remix 37i9dQZF1DXcZDD7cfEKhW            pop
## 4                  7/19/19     Pop Remix 37i9dQZF1DXcZDD7cfEKhW            pop
## 5                   3/5/19     Pop Remix 37i9dQZF1DXcZDD7cfEKhW            pop
## 6                  7/11/19     Pop Remix 37i9dQZF1DXcZDD7cfEKhW            pop
##   playlist_subgenre danceability energy key loudness mode speechiness
## 1         dance pop        0.748  0.916   6   -2.634    1      0.0583
## 2         dance pop        0.726  0.815  11   -4.969    1      0.0373
## 3         dance pop        0.675  0.931   1   -3.432    0      0.0742
## 4         dance pop        0.718  0.930   7   -3.778    1      0.1020
## 5         dance pop        0.650  0.833   1   -4.672    1      0.0359
## 6         dance pop        0.675  0.919   8   -5.385    1      0.1270
##   acousticness instrumentalness liveness valence   tempo duration_ms
## 1       0.1020         0.00e+00   0.0653   0.518 122.036      194754
## 2       0.0724         4.21e-03   0.3570   0.693  99.972      162600
## 3       0.0794         2.33e-05   0.1100   0.613 124.008      176616
## 4       0.0287         9.43e-06   0.2040   0.277 121.956      169093
## 5       0.0803         0.00e+00   0.0833   0.725 123.976      189052
## 6       0.0799         0.00e+00   0.1430   0.585 124.982      163049
# Scatter plot: Popularity vs Duration
ggplot(Spotify, aes(x = track_popularity, y = duration_ms)) +
  geom_point() +
  labs(x = "Track Popularity", y = "Duration (ms)") +
  ggtitle("Popularity vs Duration")

# Load required packages
library(ggplot2)
library(dplyr)

# Load the Spotifyset
Spotify <- read.csv("spotify_songs.csv")  # Replace "Spotify.csv" with the actual file name and path

# Explore the Spotifyset
head(Spotify)  # Check the structure and contents of the Spotifyset
##                 track_id                                            track_name
## 1 6f807x0ima9a1j3VPbc7VN I Don't Care (with Justin Bieber) - Loud Luxury Remix
## 2 0r7CVbZTWZgbTCYdfa2P31                       Memories - Dillon Francis Remix
## 3 1z1Hg7Vb0AhHDiEmnDE79l                       All the Time - Don Diablo Remix
## 4 75FpbthrwQmzHlBJLuGdC7                     Call You Mine - Keanu Silva Remix
## 5 1e8PAfcKUYoKkxPhrHqw4x               Someone You Loved - Future Humans Remix
## 6 7fvUMiyapMsRRxr07cU8Ef     Beautiful People (feat. Khalid) - Jack Wins Remix
##       track_artist track_popularity         track_album_id
## 1       Ed Sheeran               66 2oCs0DGTsRO98Gh5ZSl2Cx
## 2         Maroon 5               67 63rPSO264uRjW1X5E6cWv6
## 3     Zara Larsson               70 1HoSmj2eLcsrR0vE9gThr4
## 4 The Chainsmokers               60 1nqYsOef1yKKuGOVchbsk6
## 5    Lewis Capaldi               69 7m7vv9wlQ4i0LFuJiE2zsQ
## 6       Ed Sheeran               67 2yiy9cd2QktrNvWC2EUi0k
##                                        track_album_name
## 1 I Don't Care (with Justin Bieber) [Loud Luxury Remix]
## 2                       Memories (Dillon Francis Remix)
## 3                       All the Time (Don Diablo Remix)
## 4                           Call You Mine - The Remixes
## 5               Someone You Loved (Future Humans Remix)
## 6     Beautiful People (feat. Khalid) [Jack Wins Remix]
##   track_album_release_date playlist_name            playlist_id playlist_genre
## 1                  6/14/19     Pop Remix 37i9dQZF1DXcZDD7cfEKhW            pop
## 2                 12/13/19     Pop Remix 37i9dQZF1DXcZDD7cfEKhW            pop
## 3                   7/5/19     Pop Remix 37i9dQZF1DXcZDD7cfEKhW            pop
## 4                  7/19/19     Pop Remix 37i9dQZF1DXcZDD7cfEKhW            pop
## 5                   3/5/19     Pop Remix 37i9dQZF1DXcZDD7cfEKhW            pop
## 6                  7/11/19     Pop Remix 37i9dQZF1DXcZDD7cfEKhW            pop
##   playlist_subgenre danceability energy key loudness mode speechiness
## 1         dance pop        0.748  0.916   6   -2.634    1      0.0583
## 2         dance pop        0.726  0.815  11   -4.969    1      0.0373
## 3         dance pop        0.675  0.931   1   -3.432    0      0.0742
## 4         dance pop        0.718  0.930   7   -3.778    1      0.1020
## 5         dance pop        0.650  0.833   1   -4.672    1      0.0359
## 6         dance pop        0.675  0.919   8   -5.385    1      0.1270
##   acousticness instrumentalness liveness valence   tempo duration_ms
## 1       0.1020         0.00e+00   0.0653   0.518 122.036      194754
## 2       0.0724         4.21e-03   0.3570   0.693  99.972      162600
## 3       0.0794         2.33e-05   0.1100   0.613 124.008      176616
## 4       0.0287         9.43e-06   0.2040   0.277 121.956      169093
## 5       0.0803         0.00e+00   0.0833   0.725 123.976      189052
## 6       0.0799         0.00e+00   0.1430   0.585 124.982      163049
# Scatter plot: Popularity vs Duration
ggplot(Spotify, aes(x = loudness, y = danceability)) +
  geom_point() +
  labs(x = "Loudness", y = "Danceability") +
  ggtitle("Loudness vs Danceability")

# Read the dataset
Spotify <- read.csv("spotify_songs.csv")

# Create a scatter plot of population versus danceability
ggplot(Spotify, aes(x = track_popularity, y = loudness)) +
  geom_point() +
  labs(x = "Popularity", y = "Loudness") +
  ggtitle("Population versus Loudness")

# Read the dataset
Spotify <- read.csv("spotify_songs.csv")

# Create a scatter plot of population versus danceability
ggplot(Spotify, aes(x = track_popularity, y = tempo)) +
  geom_point() +
  labs(x = "Popularity", y = "Tempo") +
  ggtitle("Population versus Tempo")

Though the structure of each song is in some way unique, there are definitely some common threads happening. Let us check for the correlation between various attributes of a song.

# Extract relevant columns (attributes)
attributes <- Spotify[c("acousticness", "loudness", "valence", "danceability", "liveness", "energy", "instrumentalness","key","tempo","duration_ms","speechiness")]

# Create a correlation matrix
att_cor <- cor(attributes)

# Plot the correlation matrix using ggcorrplot
ggcorrplot(att_cor, type = "lower", hc.order = TRUE, colors = c("orange", "lightyellow", "lightblue"))

From the correlation plot, we can observe that:

There exists a high positive correlation between energy and loudness.

There exists a high negative correlation between energy and acousticness.

There are moderate correlation between loudness and acousticness, and between valence and danceability.

We can also observe that speechiness, tempo and key have no strong correlation with track popularity. Thus, we can conclude that popularity is influenced by the following charateristics:

acousticness loudness valence danceability liveness energy instrumentalness This study can be helpful to us when we try to build a predictive model.

In this section, We trying to come up with a model which can predict the popularity of a song given all other attributes. More particulary, the model can help to predict in which popularity class: low,medium or high does the song feature by comparing its other attributes.

Logistic Regression with multinomial(NNET) variables

We can make use of a logistic regression with multinomial variables as there are three different popularity classes in our response variable. We have seen from the correlation plot during our exploratory data analysis that the track popularity has correlation with variables : acousticness, loudness, valence, danceability, liveness, energy and instrumentalness. So it is a good idea to build the model by fitting the popularity class with all these attributes. First step is to randomly split the whole dataset into training (75%) and testing (25%) set for model validation. I would train the model with the training set and then test the perdictive capability of the model using the testing set.

Spotify <- Spotify %>%mutate(popularity = case_when(track_popularity <= 30 ~ "low",
                                track_popularity > 30 & track_popularity <= 75  ~ "medium",
                                track_popularity > 75 ~ "high"))
spotify_train <- Spotify[c(12:15,17,18:21,22:24)]
set.seed(123)
train_idx <- sample(nrow(spotify_train), .70*nrow(spotify_train))

train <- spotify_train[train_idx,]
test <- spotify_train[-train_idx,]

Now , let us perform the model fitting and analysis: When we build logistic models we need to set one of the levels of the dependent variable as a baseline. We achieve this by using relevel() function.

# Setting the baseline 
train$popularity <- relevel(factor(train$popularity), ref = "low")

Once the baseline has been specified, we use multinom() function to fit the model and then use summary() function to explore the beta coefficients of the model.

# Fit multinomial logistic regression model using nnet
nnet_model <- multinom(popularity ~ ., data = train, MaxNWts = 10000)
## # weights:  39 (24 variable)
## initial  value 25249.406230 
## iter  10 value 19934.785195
## iter  20 value 19317.181999
## iter  30 value 18967.065686
## final  value 18966.931980 
## converged
# View the summary of the model
summary(nnet_model)
## Call:
## multinom(formula = popularity ~ ., data = train, MaxNWts = 10000)
## 
## Coefficients:
##        (Intercept) danceability    energy         key   loudness speechiness
## high      4.809515   1.04371496 -5.455723 0.013874612 0.33368610  -0.4953364
## medium    2.776981   0.09018309 -1.366784 0.003423022 0.05690381  -0.5774267
##        acousticness instrumentalness   liveness   valence       tempo
## high      0.3634678       -3.4531041 -0.7616111 0.3151996 0.002773656
## medium    0.4702569       -0.5508969 -0.2649272 0.1121809 0.001109072
##          duration_ms
## high   -5.484978e-06
## medium -3.892645e-06
## 
## Std. Errors:
##         (Intercept) danceability       energy         key     loudness
## high   3.189888e-07 2.019264e-07 2.462579e-07 1.56557e-06 1.393468e-06
## medium 1.469817e-06 8.936189e-07 1.187271e-06 7.35239e-06 6.289533e-06
##         speechiness acousticness instrumentalness     liveness      valence
## high   5.950934e-08 4.809681e-08     5.547265e-09 5.777233e-08 1.662807e-07
## medium 2.305388e-07 1.973743e-07     8.739571e-08 3.021510e-07 7.455947e-07
##               tempo  duration_ms
## high   8.147455e-05 1.320738e-07
## medium 3.183608e-04 1.687891e-07
## 
## Residual Deviance: 37933.86 
## AIC: 37981.86

The output of summary contains the table for coefficients and a table for standard error. Each row in the coefficient table corresponds to the model equation. This ratio of the probability of choosing other popularity classes over the baseline class that is “low” is referred to as relative risk (often described as odds). However, the output of the model is the log of odds. To get the relative risk IE odds ratio, we need to exponentiate the coefficients.

# Extracting coefficients and exponentiating
nnet_coefficients <- coef(nnet_model)
nnet_odds_ratios <- exp(nnet_coefficients)

# Print the exponentiated coefficients
print(nnet_odds_ratios)
##        (Intercept) danceability      energy      key loudness speechiness
## high     122.67216     2.839747 0.004271785 1.013971 1.396105   0.6093659
## medium    16.07042     1.094375 0.254925360 1.003429 1.058554   0.5613410
##        acousticness instrumentalness  liveness  valence    tempo duration_ms
## high       1.438309       0.03164725 0.4669136 1.370533 1.002778   0.9999945
## medium     1.600405       0.57643258 0.7672618 1.118715 1.001110   0.9999961

The relative risk ratio for a one-unit increase in the variables for being in high and medium popularity classes vs. low popularity class is shown in the above output. Here a value of 1 represents that there is no change. However, a value greater than 1 represents an increase and value less than 1 represents a decrease. We can also use probabilities to understand our model.

# Assuming nnet_model is your fitted multinomial logistic regression model using nnet
predicted_probs <- predict(nnet_model, type = "probs", newdata = train)

# Display the head of the probability table
head(predicted_probs)
##             low       high    medium
## 2986  0.2728531 0.08295117 0.6441958
## 29925 0.3920592 0.01091166 0.5970291
## 29710 0.2770265 0.07331618 0.6496573
## 2757  0.4086572 0.01701446 0.5743283
## 9642  0.1973675 0.25185298 0.5507795
## 31313 0.4418558 0.01422519 0.5439190

The table above indicates that the probability of 2986th obviously being in the medium popularity is 64.41%, it being low popularity is 27.28% and it being high popularity is 0.08%. Thus we can conclude that the 2986th observation is medium popular. On a similar note – 29925th observation is medium popularity, 29710th observations is also medium popularity and so on. We will now check the model accuracy by building classification table. So let us first build the classification table for training dataset and calculate the model accuracy.

# Assuming nnet_model is your fitted multinomial logistic regression model using nnet
train$predicted <- predict(nnet_model, newdata = train, type = "class")

# Building the classification table
ctable <- table(train$popularity, train$predicted)

# Calculating accuracy - sum of diagonal elements divided by total observations
accuracy <- sum(diag(ctable)) / sum(ctable)

# Print accuracy (percentage)
cat("Accuracy:", round(accuracy * 100, 2), "%\n")
## Accuracy: 62.19 %

Accuracy in training dataset is 62.19%. We now repeat the above on the testing dataset.

# Assuming nnet_model is your fitted multinomial logistic regression model using nnet
test$predicted <- predict(nnet_model, newdata = test, type = "class")

# Building the classification table
ctable <- table(test$popularity, test$predicted)

# Calculating accuracy - sum of diagonal elements divided by total observations
accuracy <- sum(diag(ctable)) / sum(ctable)

# Print accuracy (percentage)
cat("Accuracy:", round(accuracy * 100, 2), "%\n")
## Accuracy: 59.36 %

We were able to find out a model which predicts the popularity class with a 59.36% accuracy.

k-NN

model_data <- Spotify %>%
  mutate(popularity_gp = case_when(
     track_popularity >= 0 & track_popularity <= 31 ~ "Least_Popularity",
                  track_popularity >= 32 & track_popularity <= 52 ~ "Average_Popularity",
                  TRUE ~ "Highest_Popularity"
  )) %>%
  select(where(is.numeric), -c( playlist_genre, track_popularity,duration_ms), popularity_gp)

model_data$popularity_gp = as.factor(model_data$popularity_gp)
str(model_data)
## 'data.frame':    32833 obs. of  12 variables:
##  $ danceability    : num  0.748 0.726 0.675 0.718 0.65 0.675 0.449 0.542 0.594 0.642 ...
##  $ energy          : num  0.916 0.815 0.931 0.93 0.833 0.919 0.856 0.903 0.935 0.818 ...
##  $ key             : int  6 11 1 7 1 8 5 4 8 2 ...
##  $ loudness        : num  -2.63 -4.97 -3.43 -3.78 -4.67 ...
##  $ mode            : int  1 1 0 1 1 1 0 0 1 1 ...
##  $ speechiness     : num  0.0583 0.0373 0.0742 0.102 0.0359 0.127 0.0623 0.0434 0.0565 0.032 ...
##  $ acousticness    : num  0.102 0.0724 0.0794 0.0287 0.0803 0.0799 0.187 0.0335 0.0249 0.0567 ...
##  $ instrumentalness: num  0.00 4.21e-03 2.33e-05 9.43e-06 0.00 0.00 0.00 4.83e-06 3.97e-06 0.00 ...
##  $ liveness        : num  0.0653 0.357 0.11 0.204 0.0833 0.143 0.176 0.111 0.637 0.0919 ...
##  $ valence         : num  0.518 0.693 0.613 0.277 0.725 0.585 0.152 0.367 0.366 0.59 ...
##  $ tempo           : num  122 100 124 122 124 ...
##  $ popularity_gp   : Factor w/ 3 levels "Average_Popularity",..: 2 2 2 2 2 2 2 2 2 2 ...
table(model_data$popularity_gp)
## 
## Average_Popularity Highest_Popularity   Least_Popularity 
##               9560              12942              10331
set.seed(3245)
gp <- runif(nrow(model_data))
model_data <- model_data[order(gp),]
head(model_data,5)
##       danceability energy key loudness mode speechiness acousticness
## 1031         0.703  0.885   9   -6.712    1      0.0322      0.02750
## 12523        0.581  0.854   1   -8.485    0      0.0428      0.00197
## 22966        0.598  0.526  10   -8.659    0      0.0415      0.12900
## 12458        0.608  0.768   1   -9.911    1      0.0364      0.10100
## 13070        0.730  0.785   2   -7.201    1      0.0456      0.04180
##       instrumentalness liveness valence   tempo      popularity_gp
## 1031          2.84e-03   0.2550   0.939 123.997 Average_Popularity
## 12523         1.30e-03   0.1110   0.788 131.180 Highest_Popularity
## 22966         0.00e+00   0.1400   0.529 123.935 Highest_Popularity
## 12458         1.41e-06   0.0942   0.748 132.699 Highest_Popularity
## 13070         6.69e-03   0.1230   0.724 137.639 Average_Popularity
summary(model_data[,-11])
##   danceability        energy              key            loudness      
##  Min.   :0.0000   Min.   :0.000175   Min.   : 0.000   Min.   :-46.448  
##  1st Qu.:0.5630   1st Qu.:0.581000   1st Qu.: 2.000   1st Qu.: -8.171  
##  Median :0.6720   Median :0.721000   Median : 6.000   Median : -6.166  
##  Mean   :0.6548   Mean   :0.698619   Mean   : 5.374   Mean   : -6.720  
##  3rd Qu.:0.7610   3rd Qu.:0.840000   3rd Qu.: 9.000   3rd Qu.: -4.645  
##  Max.   :0.9830   Max.   :1.000000   Max.   :11.000   Max.   :  1.275  
##       mode         speechiness      acousticness    instrumentalness   
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000000  
##  1st Qu.:0.0000   1st Qu.:0.0410   1st Qu.:0.0151   1st Qu.:0.0000000  
##  Median :1.0000   Median :0.0625   Median :0.0804   Median :0.0000161  
##  Mean   :0.5657   Mean   :0.1071   Mean   :0.1753   Mean   :0.0847472  
##  3rd Qu.:1.0000   3rd Qu.:0.1320   3rd Qu.:0.2550   3rd Qu.:0.0048300  
##  Max.   :1.0000   Max.   :0.9180   Max.   :0.9940   Max.   :0.9940000  
##     liveness         valence                  popularity_gp  
##  Min.   :0.0000   Min.   :0.0000   Average_Popularity: 9560  
##  1st Qu.:0.0927   1st Qu.:0.3310   Highest_Popularity:12942  
##  Median :0.1270   Median :0.5120   Least_Popularity  :10331  
##  Mean   :0.1902   Mean   :0.5106                             
##  3rd Qu.:0.2480   3rd Qu.:0.6930                             
##  Max.   :0.9960   Max.   :0.9910
normalize <- function(x) {
  return((x - min(x))/(max(x) - min(x)))
}

model_norm <- model_data
model_norm$popularity_gp <- NULL
model_norm <- as.data.frame(lapply(model_norm,normalize))
summary(model_norm)
##   danceability        energy            key            loudness     
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.5727   1st Qu.:0.5809   1st Qu.:0.1818   1st Qu.:0.8021  
##  Median :0.6836   Median :0.7210   Median :0.5455   Median :0.8441  
##  Mean   :0.6662   Mean   :0.6986   Mean   :0.4886   Mean   :0.8325  
##  3rd Qu.:0.7742   3rd Qu.:0.8400   3rd Qu.:0.8182   3rd Qu.:0.8760  
##  Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   Max.   :1.0000  
##       mode         speechiness       acousticness     instrumentalness   
##  Min.   :0.0000   Min.   :0.00000   Min.   :0.00000   Min.   :0.0000000  
##  1st Qu.:0.0000   1st Qu.:0.04466   1st Qu.:0.01519   1st Qu.:0.0000000  
##  Median :1.0000   Median :0.06808   Median :0.08089   Median :0.0000162  
##  Mean   :0.5657   Mean   :0.11663   Mean   :0.17639   Mean   :0.0852587  
##  3rd Qu.:1.0000   3rd Qu.:0.14379   3rd Qu.:0.25654   3rd Qu.:0.0048592  
##  Max.   :1.0000   Max.   :1.00000   Max.   :1.00000   Max.   :1.0000000  
##     liveness          valence           tempo       
##  Min.   :0.00000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.09307   1st Qu.:0.3340   1st Qu.:0.4175  
##  Median :0.12751   Median :0.5166   Median :0.5095  
##  Mean   :0.19094   Mean   :0.5152   Mean   :0.5048  
##  3rd Qu.:0.24900   3rd Qu.:0.6993   3rd Qu.:0.5593  
##  Max.   :1.00000   Max.   :1.0000   Max.   :1.0000
set.seed(123)
train_idx <- sample(nrow(model_norm), .80*nrow(model_norm))

model_train <- model_norm[train_idx,]
model_test <- model_norm[-train_idx,]
model_train_target <- model_data[train_idx,11]
model_test_target <- model_data[-train_idx,11]

sqrt(nrow(model_data))
## [1] 181.1988
sum(diag(cm))/length(model_test_target)
## Error in as.integer(x): cannot coerce type 'closure' to vector of type 'integer'

Implications

We considered this project to be helpful for artists to understand what their audience is looking for and help them improve the popularity of their tracks. It was also meant to help music distributors to streamline their music library. The observations we found out from analysis can be used by an artist to improve the popularity of their songs. Creating songs with shorter duration or highly danceable songs have more chance to gain popularity.Maybe even the title of a song might affect the popularity of a song.Artists can try including common words like “Love”,“Like” etc which we found in most of the popular song titles. Maybe those words can help them to be featured in popular playlists. Music distributors could focus more on the genres which are popular among spotify users of current generation.Also, the genre R&B looks to gain popularity over the years. Hence, R&B artists can be collaborated for more works. Also more playlists related to danceable songs can also be included considering the popularity of danceable songs. Users can make use of our Song Recommendation Engine to get recommendations as per their preferences. Limitations

Even though spotify features over a 50 million songs, we are performing our analysis on a dataset with around 32k records.Using a dynamic dataset can improve the results of the analysis. Additional attributes can also be considered which can help our analysis like including the number of times a particular song has played or the most downloaded playlists. The dataset doesnot include any demographic attribute. Popularity of songs can be affected by the demography of the listeners. People in different countries might have different music tastes. A demographic data can provide more insights. We have tried a linear regression model here.A clustering or neural network analysis can also be used and tried to develop a better model. We have not considered multicollinearity of the variables while developing the model as the correlation is not that high .But if we can work with much larger dataset and find considerable collinearity between variables , we can take into account multicollinearity effect and try to remove it while building the model.